// createData.do
// This .do file sets file path globals and runs all the do files in this folder in order to create the data, figures, and tables in the paper.
// Date last updated: 1/27/2025

********************************************************************************	
* Set global file paths (change as needed)
********************************************************************************
clear all

// Raw data
	global pathr "YOURPATHHERE\replication\data\raw"
// Intermediate data
	global pathi "YOURPATHHERE\replication\data\intermediate"
// Final data
	global pathf "YOURPATHHERE\replication\data\final"
		
// Data creation code folder
	global pathc "YOURPATHHERE\replication\data_creation_code\sub_do_files"
// Output
	global patho "YOURPATHHERE\replication\output"

********************************************************************************	
* Run the data creation files for corelogic_slr.dta
********************************************************************************
// Create CoreLogic property transaction data //

	* This file imports and formats data from CoreLogic, which had separate tax roll and deed information of single-family residential properties in the following East Coast states: CT, DE, FL, GA, MA, MD, ME, NC, NH, NJ, NY, PA, RI, SC, and VA.
	* The data was directly queried from the Federal Reserve Data Warehouse (RADAR) in August 2020.
	
	do "$pathc\coreLogicData.do"

********************************************************************************
// Process Sea Level Rise (SLR) shape files //

	* This file imports processed NOAA sea level rise (SLR) shape files for each state and separately saves the data and coordinates for each foot of SLR (0ft, 1ft, ..., 6ft) as .dta files. 
	* The SLR data were converted from raw .gdb files/folders to shape files using the Python code called noaaDataToShp.py. It's necessary to have ArcGIS Pro installed to run this script.
	* The raw NOAA sea level rise vectors come from https://coast.noaa.gov/slrdata/Sea_Level_Rise_Vectors/index.html, originally retrieved in August 2020 for each state listed above. We downloaded the `state'_slr_data_dist.zip files for each state.

	* run noaaDataToShp.py outside of Stata
	do "$pathc\slrShapeFiles.do"
	
********************************************************************************
// Create distance from coast variable using NOAA data //

	 * Each property's distance from the nearest coastline is calculated using its precise coordinates (given in the CoreLogic data) and NOAA's Continuously Updated Shoreline Product (https://coast.noaa.gov/digitalcoast/data/cusp.html)
	 * The .do file then takes this data by state, appends all states together, and saves distancefull.dta with the intermediate distance data.
	 * Additionally, we download the shape files of US medium shoreline and make a list of coordinates within 1km of the shoreline
	 
	do "$pathc\distFromCoast.do"
	
********************************************************************************
// Import and format county-level controls //

	* This file cleans and saves .dta files for the following county-level controls:
	* Climate opinions, election results, county income and population, demographics, education level, unemployment rate, test scores, crime, number of new buildings, and previous flood events (see README for details)
	* The do file also creates a crosswalk from zip code to county (in 2014, the year we have for buyer beliefs), in addition to a crosswalk between county FIPS codes and their full names.
	
	do "$pathc\countyControls.do"
	
********************************************************************************
// Gather data on which loans are conforming to GSEs //

	* This do file compiles county-by-year loan limit information from the Federal Housing Finance Agency (FHFA). 
	* We collect data from 2009-2016 here: https://www.fhfa.gov/data/conforming-loan-limit
	
	do "$pathc\conformingLoans.do"
	
********************************************************************************
// Create county-level Gallup poll information for alternate measure of climate belief //

	* This uses both an R file and a Stata file to clean and prepare data from Gallup's annual environment poll (part of the Gallup Poll Social Series) to use as an alternative measure of climate beliefs. 
    * Step 1: run the R file, which combines the GPSS data with annual ZIP-level pollution level data (PM 2.5) from https://www.earthdata.nasa.gov/data/catalog/sedac-ciesin-sedac-aqdh-pm25o3no2-zipcode-1.00#toc-product-summary, annual housing price data from the Census at the county level, and employment and wage data from the Quarterly Census of Employment and Wages (https://www.bls.gov/web/cewqtr.supp.toc.htm) at the county level.
    * Step 2: run the Stata file, which cleans the merged dataset and adds it to the population and education data created above.  It then uses regressions to impute time-varying county-level estimates for being "worried" about climate change, as not all years are present in the data.
	 
	* run gallupAlternateBeliefs1.R in R
	do "$pathc\gallupAlternateBeliefs2.do"
	
********************************************************************************
// Find First Street ID numbers for CoreLogic properties //

	* This file matches First Street properties (identified with FSID) with our dataset of CoreLogic properties (identified by prop_id_dw). This merging is done based on both property coordinates (which are done in several iterations, as they can be inexact) and on street address. 
	* The raw First Street data was obtained using an API which is no longer in use; First Street data can now be accessed using RADAR.
	
	do "$pathc\firstStreetID.do"
	
********************************************************************************
// Restricts to 1km of the coast, merges FS elevation, cleans and reshapes //

	* This .do file combines many of the above intermediate datasets, as well as other raw First Street property-level data on bare earth elevation, historic flood events, climate adaptations, and sea depth probability. It also reshapes the data to the transaction level, creates bins for elevation and distance to the coast, and consolidates some mortgage variables. It saves intermediate.dta.
	* The raw First Street data was obtained using an API which is no longer in use; First Street data can now be accessed using RADAR.
	* CoreLogic's Climate Risk database now also has bare earth elevation for each property; this was not available at the time our data was created. If replicating, you can download gr_el_used along with the property ID and merge that to the rest of the CoreLogic data.
	* It saves intermediate.dta.

	do "$pathc\createData1km.do"
	
********************************************************************************
// Determine which properties are in FEMA special flood hazard areas //

	* This do file uses intermediate.dta to creates CSVs of property coordinates, which must then be put into FEMA's flood hazard layer in ArcGIS to get the flood zones. The ArcGIS output is then imported and appended into one file that can be merged to the main processed data. It creates a dummy variable that identifies if a property is in a FEMA flood zone. The output is prop_FEMA_zone.dta.
	* To access the flood hazard layer, go to https://hazards.fema.gov/arcgis/rest/services/public/NFHL/MapServer/28

	do "$pathc\femaZone.do"
	
********************************************************************************
// Combine to make final dataset //

	* This .do file combines many of the above input files into the dataset used for the final specification, corelogic_slr.dta. It cleans the data in the following ways:
      * Restricts to properties <= 1km from the coast
      * Drops properties with sale values below $50,000 and above $10,000,000
      * Flags mortgages without a standard 15 or 30-year term
      * Specifies high-cost counties
	  * Manually adds conforming loan limits for years 2000-2008
      * Other small variable changes

	* Outputs: corelogic_slr.dta

	do "$pathc\createCoreLogicSLR.do"
	
********************************************************************************	
* Run the data creation file for GSE_FS.dta and gse_fs_time_series.dta
********************************************************************************
// Merge FS flood factor data to GSE conforming loan balance info at 3-digit ZIP level //

	* This file imports raw property-level First Street data with each property's Flood Factor (proprietary measure of flood risk) and aggregates it to the 3-digit ZIP code level, saving the average flood factor for each of these.  
	* It then merges this with GSE data on conforming loans using the 3-digit ZIP code, later cleaning and reshaping this to be a time series at the loan-quarter level, storing whether the loan is defaulted or not.

	do "$pathc\createGSEfs.do"
	